** Este documento no contiene los resultados correctos, sino un ejmeplo de como tratar los datos
Lending Club es una de las empresas P2P más grandes, publica regularmente conjuntos de datos anonimizados de sus clientes que contienen características de los préstamos y los prestatarios. Esta información se puede utilizar para clasificar si un prestatario incumplirá o no con su préstamo.
Por ejemplo, los bancos suelen tener datos informativos para crear modelos que ayuden a decidir a quién conceder o denegar un préstamo. Es un problema de clasificación supervisado.
Se desea predecir, antes de conceder un préstamo a una persona, la probabilidad de que no se devuelva completamente. Por tanto, todas las variables del dataset que se empleen para el modelo, deben poderse utilizar en el momento de su llamada. Para ello, se empleará un algoritmo de clasificación supervisado.
Los pasos a realizar son:
1. Análisis inicial de los datos y preprocesamiento inicial
2. Correlaciones, tratamiento de missing y outliers
4. Tratamiento de variables categoricas: encoding
5. Aplicación de algoritmos
6. Evaluación con la muestra de test
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
import plotly.express as px
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 5000)
def dame_variables_categoricas(dataset=None):
'''
----------------------------------------------------------------------------------------------------------
Función dame_variables_categoricas:
----------------------------------------------------------------------------------------------------------
-Descripción: Función que recibe un dataset y devuelve una lista con los nombres de las
variables categóricas
-Inputs:
-- dataset: Pandas dataframe que contiene los datos
-Return:
-- lista_variables_categoricas: lista con los nombres de las variables categóricas del
dataset de entrada con menos de 100 valores diferentes
-- 1: la ejecución es incorrecta
'''
if dataset is None:
print(u'\nFaltan argumentos por pasar a la función')
return 1
lista_variables_categoricas = []
other = []
for i in dataset.columns:
if (dataset[i].dtype!=float) & (dataset[i].dtype!=int):
unicos = int(len(np.unique(dataset[i].dropna(axis=0, how='all'))))
if unicos < 100:
lista_variables_categoricas.append(i)
else:
other.append(i)
return lista_variables_categoricas, other
path_folder = "./data/"
pd_loan_p1 = pd.read_csv(path_folder +"Loan_training_set_1_4.csv", skiprows=1, low_memory=False)
pd_loan_p2 = pd.read_csv(path_folder +"Loan_training_set_2_4.csv", skiprows=1, low_memory=False)
pd_loan_p3 = pd.read_csv(path_folder +"Loan_training_set_3_4.csv", skiprows=1, low_memory=False)
pd_loan_p4 = pd.read_csv(path_folder +"Loan_training_set_4_4.csv", skiprows=1, low_memory=False)
# comprobacion de que son las mismas columnas
print(len(pd_loan_p1.columns),
len(pd_loan_p2.columns),
len(pd_loan_p3.columns),
len(pd_loan_p4.columns))
print(set(pd_loan_p1.columns==pd_loan_p2.columns),
set(pd_loan_p2.columns==pd_loan_p3.columns),
set(pd_loan_p3.columns==pd_loan_p4.columns))
151 151 151 151
{True} {True} {True}
El conjunto de datos contiene variables que no estan disponibles para Lending Club en el momento en que un prestatario presenta una solicitud de préstamo en su plataforma. Estas deben eliminarse para realizar modelos ya que se estaría introduciendo información que el modelo en el momento en que lo queremos aplicar no tendría.
list_future_variables = ["acc_now_delinq", "acc_open_past_24mths", "avg_cur_bal",
"bc_open_to_buy", "bc_util", "chargeoff_within_12_mths",
"collection_recovery_fee", "collections_12_mths_ex_med", "debt_settlement_flag",
"delinq_2yrs", "delinq_amnt", "disbursement_method", "funded_amnt",
"funded_amnt_inv", "hardship_flag", "inq_last_6mths", "last_credit_pull_d",
"last_fico_range_high", "last_fico_range_low", "last_pymnt_amnt",
"last_pymnt_d", "mo_sin_rcnt_rev_tl_op", "mo_sin_rcnt_tl", "mths_since_recent_bc",
"mths_since_recent_inq","num_accts_ever_120_pd", "num_actv_bc_tl", "num_actv_rev_tl",
"num_bc_sats", "num_bc_tl", "num_il_tl", "num_op_rev_tl", "num_rev_accts",
"num_rev_tl_bal_gt_0", "num_sats", "num_tl_120dpd_2m", "num_tl_30dpd",
"num_tl_90g_dpd_24m", "num_tl_op_past_12m", "out_prncp", "out_prncp_inv",
"pct_tl_nvr_dlq", "percent_bc_gt_75", "pymnt_plan", "recoveries", "tax_liens",
"tot_coll_amt", "tot_cur_bal", "tot_hi_cred_lim", "total_bal_ex_mort",
"total_bc_limit", "total_il_high_credit_limit", "total_pymnt", "total_pymnt_inv",
"total_rec_int", "total_rec_late_fee", "total_rec_prncp", "total_rev_hi_lim",
"issue_d", "policy_code", "revol_bal", "next_pymnt_d"]
list_other_var = ['title']
len(list_future_variables)
62
#union de las tablas
pd_loan = pd.concat([pd_loan_p1, pd_loan_p2, pd_loan_p3, pd_loan_p4], axis=0)\
.drop(list_future_variables+list_other_var,axis=1)\
.set_index(["id", "url", "zip_code"])
pd_loan.head()
| member_id | loan_amnt | term | int_rate | installment | grade | sub_grade | emp_title | emp_length | home_ownership | annual_inc | verification_status | loan_status | desc | purpose | addr_state | dti | earliest_cr_line | fico_range_low | fico_range_high | mths_since_last_delinq | mths_since_last_record | open_acc | pub_rec | revol_util | total_acc | initial_list_status | mths_since_last_major_derog | application_type | annual_inc_joint | dti_joint | verification_status_joint | open_acc_6m | open_act_il | open_il_12m | open_il_24m | mths_since_rcnt_il | total_bal_il | il_util | open_rv_12m | open_rv_24m | max_bal_bc | all_util | inq_fi | total_cu_tl | inq_last_12m | mo_sin_old_il_acct | mo_sin_old_rev_tl_op | mort_acc | mths_since_recent_bc_dlq | mths_since_recent_revol_delinq | pub_rec_bankruptcies | revol_bal_joint | sec_app_fico_range_low | sec_app_fico_range_high | sec_app_earliest_cr_line | sec_app_inq_last_6mths | sec_app_mort_acc | sec_app_open_acc | sec_app_revol_util | sec_app_open_act_il | sec_app_num_rev_accts | sec_app_chargeoff_within_12_mths | sec_app_collections_12_mths_ex_med | sec_app_mths_since_last_major_derog | hardship_type | hardship_reason | hardship_status | deferral_term | hardship_amount | hardship_start_date | hardship_end_date | payment_plan_start_date | hardship_length | hardship_dpd | hardship_loan_status | orig_projected_additional_accrued_interest | hardship_payoff_balance_amount | hardship_last_payment_amount | debt_settlement_flag_date | settlement_status | settlement_date | settlement_amount | settlement_percentage | settlement_term | |||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | url | zip_code | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 112435993 | https://lendingclub.com/browse/loanDetail.action?loan_id=112435993 | 148xx | NaN | 2300.0 | 36 months | 12.62% | 77.08 | C | C1 | NaN | NaN | OWN | 10000.0 | Not Verified | Current | NaN | credit_card | NY | 21.61 | Sep-1985 | 700.0 | 704.0 | NaN | NaN | 4.0 | 0.0 | 55.1% | 6.0 | w | NaN | Individual | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | NaN | 0.0 | NaN | 1.0 | 2.0 | 2315.0 | 55.0 | 1.0 | 0.0 | 2.0 | NaN | 381.0 | 0.0 | NaN | NaN | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 112290210 | https://lendingclub.com/browse/loanDetail.action?loan_id=112290210 | 021xx | NaN | 16000.0 | 60 months | 12.62% | 360.95 | C | C1 | teacher | 10+ years | MORTGAGE | 94000.0 | Not Verified | Current | NaN | debt_consolidation | MA | 25.61 | Jun-1992 | 715.0 | 719.0 | NaN | NaN | 9.0 | 0.0 | 105.8% | 26.0 | w | NaN | Individual | NaN | NaN | NaN | 2.0 | 1.0 | 0.0 | 0.0 | 26.0 | 11078.0 | 69.0 | 3.0 | 5.0 | 1962.0 | 94.0 | 0.0 | 6.0 | 1.0 | 26.0 | 300.0 | 7.0 | NaN | NaN | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 112436985 | https://lendingclub.com/browse/loanDetail.action?loan_id=112436985 | 018xx | NaN | 6025.0 | 36 months | 15.05% | 209.01 | C | C4 | Front Office | 7 years | MORTGAGE | 46350.0 | Not Verified | Current | NaN | home_improvement | MA | 8.88 | Jun-2002 | 700.0 | 704.0 | NaN | NaN | 11.0 | 0.0 | 44.9% | 27.0 | w | NaN | Individual | NaN | NaN | NaN | 1.0 | 0.0 | 1.0 | 2.0 | 10.0 | 0.0 | NaN | 1.0 | 2.0 | 1950.0 | 45.0 | 1.0 | 5.0 | 3.0 | 86.0 | 180.0 | 2.0 | NaN | NaN | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 112439006 | https://lendingclub.com/browse/loanDetail.action?loan_id=112439006 | 913xx | NaN | 20400.0 | 36 months | 9.44% | 652.91 | B | B1 | Manager | 10+ years | RENT | 44000.0 | Source Verified | Current | NaN | car | CA | 27.06 | Jan-2007 | 760.0 | 764.0 | NaN | NaN | 15.0 | 0.0 | 18.7% | 19.0 | w | NaN | Individual | NaN | NaN | NaN | 1.0 | 10.0 | 1.0 | 1.0 | 6.0 | 53566.0 | 95.0 | 1.0 | 2.0 | 4240.0 | 60.0 | 1.0 | 1.0 | 1.0 | 125.0 | 125.0 | 0.0 | NaN | NaN | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 112438929 | https://lendingclub.com/browse/loanDetail.action?loan_id=112438929 | 560xx | NaN | 13000.0 | 36 months | 11.99% | 431.73 | B | B5 | Paramedic | 10+ years | MORTGAGE | 85000.0 | Source Verified | Current | NaN | debt_consolidation | MN | 6.79 | Feb-2002 | 660.0 | 664.0 | 16.0 | NaN | 5.0 | 0.0 | 88% | 24.0 | w | 16.0 | Individual | NaN | NaN | NaN | 0.0 | 1.0 | 0.0 | 0.0 | 127.0 | 8466.0 | 72.0 | 0.0 | 1.0 | 2996.0 | 78.0 | 0.0 | 0.0 | 0.0 | 160.0 | 184.0 | 1.0 | 26.0 | 16.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Dimensión
print(pd_loan.shape, pd_loan.drop_duplicates().shape)
(884884, 85) (884877, 85)
Tipos de datos
pd_loan.dtypes.to_dict()
{'member_id': dtype('float64'),
'loan_amnt': dtype('float64'),
'term': dtype('O'),
'int_rate': dtype('O'),
'installment': dtype('float64'),
'grade': dtype('O'),
'sub_grade': dtype('O'),
'emp_title': dtype('O'),
'emp_length': dtype('O'),
'home_ownership': dtype('O'),
'annual_inc': dtype('float64'),
'verification_status': dtype('O'),
'loan_status': dtype('O'),
'desc': dtype('O'),
'purpose': dtype('O'),
'addr_state': dtype('O'),
'dti': dtype('float64'),
'earliest_cr_line': dtype('O'),
'fico_range_low': dtype('float64'),
'fico_range_high': dtype('float64'),
'mths_since_last_delinq': dtype('float64'),
'mths_since_last_record': dtype('float64'),
'open_acc': dtype('float64'),
'pub_rec': dtype('float64'),
'revol_util': dtype('O'),
'total_acc': dtype('float64'),
'initial_list_status': dtype('O'),
'mths_since_last_major_derog': dtype('float64'),
'application_type': dtype('O'),
'annual_inc_joint': dtype('float64'),
'dti_joint': dtype('float64'),
'verification_status_joint': dtype('O'),
'open_acc_6m': dtype('float64'),
'open_act_il': dtype('float64'),
'open_il_12m': dtype('float64'),
'open_il_24m': dtype('float64'),
'mths_since_rcnt_il': dtype('float64'),
'total_bal_il': dtype('float64'),
'il_util': dtype('float64'),
'open_rv_12m': dtype('float64'),
'open_rv_24m': dtype('float64'),
'max_bal_bc': dtype('float64'),
'all_util': dtype('float64'),
'inq_fi': dtype('float64'),
'total_cu_tl': dtype('float64'),
'inq_last_12m': dtype('float64'),
'mo_sin_old_il_acct': dtype('float64'),
'mo_sin_old_rev_tl_op': dtype('float64'),
'mort_acc': dtype('float64'),
'mths_since_recent_bc_dlq': dtype('float64'),
'mths_since_recent_revol_delinq': dtype('float64'),
'pub_rec_bankruptcies': dtype('float64'),
'revol_bal_joint': dtype('float64'),
'sec_app_fico_range_low': dtype('float64'),
'sec_app_fico_range_high': dtype('float64'),
'sec_app_earliest_cr_line': dtype('O'),
'sec_app_inq_last_6mths': dtype('float64'),
'sec_app_mort_acc': dtype('float64'),
'sec_app_open_acc': dtype('float64'),
'sec_app_revol_util': dtype('float64'),
'sec_app_open_act_il': dtype('float64'),
'sec_app_num_rev_accts': dtype('float64'),
'sec_app_chargeoff_within_12_mths': dtype('float64'),
'sec_app_collections_12_mths_ex_med': dtype('float64'),
'sec_app_mths_since_last_major_derog': dtype('float64'),
'hardship_type': dtype('O'),
'hardship_reason': dtype('O'),
'hardship_status': dtype('O'),
'deferral_term': dtype('float64'),
'hardship_amount': dtype('float64'),
'hardship_start_date': dtype('O'),
'hardship_end_date': dtype('O'),
'payment_plan_start_date': dtype('O'),
'hardship_length': dtype('float64'),
'hardship_dpd': dtype('float64'),
'hardship_loan_status': dtype('O'),
'orig_projected_additional_accrued_interest': dtype('float64'),
'hardship_payoff_balance_amount': dtype('float64'),
'hardship_last_payment_amount': dtype('float64'),
'debt_settlement_flag_date': dtype('O'),
'settlement_status': dtype('O'),
'settlement_date': dtype('O'),
'settlement_amount': dtype('float64'),
'settlement_percentage': dtype('float64'),
'settlement_term': dtype('float64')}
pd_plot_loan_status = pd_loan['loan_status']\
.value_counts(normalize=True)\
.mul(100).rename('percent').reset_index()
pd_plot_loan_status_conteo = pd_loan['loan_status'].value_counts().reset_index()
pd_plot_loan_status_pc = pd.merge(pd_plot_loan_status, pd_plot_loan_status_conteo, on=['index'], how='inner')
pd_plot_loan_status_pc
| index | percent | loan_status | |
|---|---|---|---|
| 0 | Current | 47.767710 | 422685 |
| 1 | Fully Paid | 39.047279 | 345520 |
| 2 | Charged Off | 10.967299 | 97047 |
| 3 | Late (31-120 days) | 1.262098 | 11168 |
| 4 | In Grace Period | 0.622347 | 5507 |
| 5 | Late (16-30 days) | 0.329425 | 2915 |
| 6 | Default | 0.003842 | 34 |
fig = px.histogram(pd_plot_loan_status_pc, x="index", y=['percent'])
fig.show()
elimino todos los prestamos que no están finalizados
pd_loan = pd_loan[(pd_loan['loan_status']=='Fully Paid') |
(pd_loan['loan_status']=='Charged Off')]
pd_loan.shape
(442567, 85)
pd_plot_loan_status = pd_loan['loan_status']\
.value_counts(normalize=True)\
.mul(100).rename('percent').reset_index()
pd_plot_loan_status_conteo = pd_loan['loan_status'].value_counts().reset_index()
pd_plot_loan_status_pc = pd.merge(pd_plot_loan_status,
pd_plot_loan_status_conteo, on=['index'], how='inner')
pd_plot_loan_status_pc
fig = px.histogram(pd_plot_loan_status_pc, x="index", y=['percent'])
fig.show()
pd_series_null_columns = pd_loan.isnull().sum().sort_values(ascending=False)
pd_series_null_rows = pd_loan.isnull().sum(axis=1).sort_values(ascending=False)
print(pd_series_null_columns.shape, pd_series_null_rows.shape)
pd_null_columnas = pd.DataFrame(pd_series_null_columns, columns=['nulos_columnas'])
pd_null_filas = pd.DataFrame(pd_series_null_rows, columns=['nulos_filas'])
pd_null_filas['target'] = pd_loan['loan_status'].copy()
pd_null_columnas['porcentaje_columnas'] = pd_null_columnas['nulos_columnas']/pd_loan.shape[0]
pd_null_filas['porcentaje_filas']= pd_null_filas['nulos_filas']/pd_loan.shape[1]
# # pd_null
(85,) (442567,)
pd_loan.shape
(442567, 85)
pd_null_columnas
| nulos_columnas | porcentaje_columnas | |
|---|---|---|
| member_id | 442567 | 1.000000 |
| orig_projected_additional_accrued_interest | 442522 | 0.999898 |
| hardship_dpd | 442306 | 0.999410 |
| hardship_type | 442306 | 0.999410 |
| hardship_reason | 442306 | 0.999410 |
| deferral_term | 442306 | 0.999410 |
| hardship_amount | 442306 | 0.999410 |
| hardship_start_date | 442306 | 0.999410 |
| hardship_end_date | 442306 | 0.999410 |
| payment_plan_start_date | 442306 | 0.999410 |
| hardship_length | 442306 | 0.999410 |
| hardship_status | 442306 | 0.999410 |
| hardship_loan_status | 442306 | 0.999410 |
| hardship_payoff_balance_amount | 442306 | 0.999410 |
| hardship_last_payment_amount | 442306 | 0.999410 |
| sec_app_mths_since_last_major_derog | 442294 | 0.999383 |
| sec_app_revol_util | 441782 | 0.998226 |
| sec_app_fico_range_high | 441768 | 0.998195 |
| sec_app_earliest_cr_line | 441768 | 0.998195 |
| sec_app_inq_last_6mths | 441768 | 0.998195 |
| sec_app_mort_acc | 441768 | 0.998195 |
| sec_app_open_acc | 441768 | 0.998195 |
| sec_app_open_act_il | 441768 | 0.998195 |
| sec_app_num_rev_accts | 441768 | 0.998195 |
| sec_app_chargeoff_within_12_mths | 441768 | 0.998195 |
| sec_app_collections_12_mths_ex_med | 441768 | 0.998195 |
| revol_bal_joint | 441768 | 0.998195 |
| sec_app_fico_range_low | 441768 | 0.998195 |
| annual_inc_joint | 441550 | 0.997702 |
| dti_joint | 441550 | 0.997702 |
| verification_status_joint | 441550 | 0.997702 |
| settlement_percentage | 434435 | 0.981625 |
| settlement_amount | 434435 | 0.981625 |
| settlement_date | 434435 | 0.981625 |
| settlement_status | 434435 | 0.981625 |
| debt_settlement_flag_date | 434435 | 0.981625 |
| settlement_term | 434435 | 0.981625 |
| desc | 428438 | 0.968075 |
| il_util | 421002 | 0.951273 |
| mths_since_rcnt_il | 418344 | 0.945267 |
| all_util | 417786 | 0.944006 |
| open_acc_6m | 417783 | 0.943999 |
| open_act_il | 417783 | 0.943999 |
| open_il_12m | 417783 | 0.943999 |
| open_il_24m | 417783 | 0.943999 |
| total_bal_il | 417783 | 0.943999 |
| open_rv_24m | 417783 | 0.943999 |
| inq_last_12m | 417783 | 0.943999 |
| total_cu_tl | 417783 | 0.943999 |
| inq_fi | 417783 | 0.943999 |
| open_rv_12m | 417783 | 0.943999 |
| max_bal_bc | 417783 | 0.943999 |
| mths_since_last_record | 359529 | 0.812372 |
| mths_since_recent_bc_dlq | 326642 | 0.738062 |
| mths_since_last_major_derog | 312697 | 0.706553 |
| mths_since_recent_revol_delinq | 283394 | 0.640341 |
| mths_since_last_delinq | 215680 | 0.487339 |
| emp_title | 25015 | 0.056523 |
| emp_length | 23874 | 0.053944 |
| mo_sin_old_il_acct | 12680 | 0.028651 |
| revol_util | 214 | 0.000484 |
| dti | 18 | 0.000041 |
| loan_status | 0 | 0.000000 |
| term | 0 | 0.000000 |
| int_rate | 0 | 0.000000 |
| installment | 0 | 0.000000 |
| grade | 0 | 0.000000 |
| sub_grade | 0 | 0.000000 |
| mort_acc | 0 | 0.000000 |
| pub_rec_bankruptcies | 0 | 0.000000 |
| home_ownership | 0 | 0.000000 |
| annual_inc | 0 | 0.000000 |
| verification_status | 0 | 0.000000 |
| total_acc | 0 | 0.000000 |
| initial_list_status | 0 | 0.000000 |
| purpose | 0 | 0.000000 |
| addr_state | 0 | 0.000000 |
| earliest_cr_line | 0 | 0.000000 |
| fico_range_low | 0 | 0.000000 |
| fico_range_high | 0 | 0.000000 |
| mo_sin_old_rev_tl_op | 0 | 0.000000 |
| application_type | 0 | 0.000000 |
| open_acc | 0 | 0.000000 |
| pub_rec | 0 | 0.000000 |
| loan_amnt | 0 | 0.000000 |
threshold=0.9
list_vars_not_null = list(pd_null_columnas[pd_null_columnas['porcentaje_columnas']<threshold].index)
pd_loan_filter_null = pd_loan.loc[:, list_vars_not_null]
pd_loan_filter_null.shape
(442567, 33)
pd_null_filas
| nulos_filas | target | porcentaje_filas | |||
|---|---|---|---|---|---|
| id | url | zip_code | |||
| 12277571 | https://lendingclub.com/browse/loanDetail.action?loan_id=12277571 | 797xx | 60 | Fully Paid | 0.705882 |
| 39439608 | https://lendingclub.com/browse/loanDetail.action?loan_id=39439608 | 917xx | 60 | Charged Off | 0.705882 |
| 29904844 | https://lendingclub.com/browse/loanDetail.action?loan_id=29904844 | 015xx | 60 | Fully Paid | 0.705882 |
| 59201300 | https://lendingclub.com/browse/loanDetail.action?loan_id=59201300 | 074xx | 60 | Charged Off | 0.705882 |
| 34492422 | https://lendingclub.com/browse/loanDetail.action?loan_id=34492422 | 104xx | 60 | Fully Paid | 0.705882 |
| ... | ... | ... | ... | ... | ... |
| 111376790 | https://lendingclub.com/browse/loanDetail.action?loan_id=111376790 | 937xx | 22 | Fully Paid | 0.258824 |
| 104328180 | https://lendingclub.com/browse/loanDetail.action?loan_id=104328180 | 891xx | 22 | Fully Paid | 0.258824 |
| 104312657 | https://lendingclub.com/browse/loanDetail.action?loan_id=104312657 | 028xx | 22 | Fully Paid | 0.258824 |
| 105151307 | https://lendingclub.com/browse/loanDetail.action?loan_id=105151307 | 290xx | 22 | Fully Paid | 0.258824 |
| 116805908 | https://lendingclub.com/browse/loanDetail.action?loan_id=116805908 | 775xx | 12 | Fully Paid | 0.141176 |
442567 rows × 3 columns
list_cat_vars, other = dame_variables_categoricas(dataset=pd_loan_filter_null)
pd_loan_filter_null[list_cat_vars] = pd_loan_filter_null[list_cat_vars].astype("category")
pd_loan_filter_null[list_cat_vars].head()
| emp_length | loan_status | term | grade | sub_grade | home_ownership | verification_status | initial_list_status | purpose | addr_state | application_type | |||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | url | zip_code | |||||||||||
| 112038251 | https://lendingclub.com/browse/loanDetail.action?loan_id=112038251 | 923xx | 6 years | Fully Paid | 36 months | A | A4 | OWN | Not Verified | w | credit_card | CA | Individual |
| 112149045 | https://lendingclub.com/browse/loanDetail.action?loan_id=112149045 | 985xx | 2 years | Fully Paid | 36 months | E | E3 | RENT | Source Verified | w | debt_consolidation | WA | Individual |
| 112052261 | https://lendingclub.com/browse/loanDetail.action?loan_id=112052261 | 750xx | 7 years | Fully Paid | 36 months | A | A4 | MORTGAGE | Not Verified | w | debt_consolidation | TX | Individual |
| 111999259 | https://lendingclub.com/browse/loanDetail.action?loan_id=111999259 | 958xx | 7 years | Fully Paid | 60 months | C | C5 | RENT | Source Verified | w | debt_consolidation | CA | Individual |
| 111808508 | https://lendingclub.com/browse/loanDetail.action?loan_id=111808508 | 026xx | 7 years | Fully Paid | 36 months | C | C5 | MORTGAGE | Verified | w | debt_consolidation | MA | Individual |
list_cat_vars
['emp_length', 'loan_status', 'term', 'grade', 'sub_grade', 'home_ownership', 'verification_status', 'initial_list_status', 'purpose', 'addr_state', 'application_type']
pd_loan_filter_null['term'].value_counts()
36 months 329347 60 months 113220 Name: term, dtype: int64
pd_loan_filter_null['emp_length'].value_counts()
10+ years 146118 2 years 39568 < 1 year 35725 3 years 35082 1 year 28727 4 years 25906 5 years 25742 8 years 22252 7 years 21809 6 years 20087 9 years 17677 Name: emp_length, dtype: int64
pd_loan_filter_null[list_cat_vars].dtypes
emp_length category loan_status category term category grade category sub_grade category home_ownership category verification_status category initial_list_status category purpose category addr_state category application_type category dtype: object
pd_loan_filter_null[other].head(10)
| emp_title | revol_util | int_rate | earliest_cr_line | |||
|---|---|---|---|---|---|---|
| id | url | zip_code | ||||
| 112038251 | https://lendingclub.com/browse/loanDetail.action?loan_id=112038251 | 923xx | Solutions Architect | 22.7% | 7.35% | Jul-1994 |
| 112149045 | https://lendingclub.com/browse/loanDetail.action?loan_id=112149045 | 985xx | Pse | 98.9% | 24.85% | Jan-2000 |
| 112052261 | https://lendingclub.com/browse/loanDetail.action?loan_id=112052261 | 750xx | Associate Director | 27.2% | 7.35% | Mar-2013 |
| 111999259 | https://lendingclub.com/browse/loanDetail.action?loan_id=111999259 | 958xx | Biller | 20% | 16.02% | Aug-2006 |
| 111808508 | https://lendingclub.com/browse/loanDetail.action?loan_id=111808508 | 026xx | cdl driver | 90% | 16.02% | May-2008 |
| 111071764 | https://lendingclub.com/browse/loanDetail.action?loan_id=111071764 | 080xx | Director Business Development | 24.3% | 30.79% | Oct-1989 |
| 112087682 | https://lendingclub.com/browse/loanDetail.action?loan_id=112087682 | 917xx | Senior Accountant | 56.5% | 11.99% | Apr-2008 |
| 111997144 | https://lendingclub.com/browse/loanDetail.action?loan_id=111997144 | 346xx | BUYER | 23.6% | 10.91% | Sep-1996 |
| 111670729 | https://lendingclub.com/browse/loanDetail.action?loan_id=111670729 | 761xx | Doctor | 0% | 9.44% | Jun-1999 |
| 111926953 | https://lendingclub.com/browse/loanDetail.action?loan_id=111926953 | 060xx | Supply Chain | 55% | 11.99% | Jan-2002 |
print(pd_loan_filter_null["emp_title"].value_counts().count())
pd_loan_filter_null["emp_title"]\
.apply(lambda x: str(x).lower().strip()).value_counts(normalize=True)#.count()
125011
nan 0.056523
teacher 0.022824
manager 0.022584
owner 0.011420
registered nurse 0.011243
...
ssg uh-60 helicopter mechanic 0.000002
program analyst 3 0.000002
database administer 0.000002
respiratory therapis 0.000002
manager hotel operations oasis 0.000002
Name: emp_title, Length: 100185, dtype: float64
pd_loan_filter_null.loc[:,["int_rate", "revol_util"]] = pd_loan_filter_null.loc[:,["int_rate", "revol_util"]]\
.apply(lambda x: x.str.replace("%", "").astype(float))
pd_loan_filter_null['emp_length'] = pd_loan_filter_null['emp_length'].str.extract('(\d+)').astype(float)
pd_loan_filter_null['term'] = pd_loan_filter_null['term'].str.extract('(\d+)').astype(float)
pd_loan_filter_null['sub_grade'] = pd_loan_filter_null['sub_grade'].str.extract('(\d+)').astype(float)
months = {'Jan' : 1, 'Feb' : 2, 'Mar' : 3, 'Apr' : 4, 'May' : 5, 'Jun' : 6,
'Jul' : 7, 'Aug' : 8, 'Sep' : 9, 'Oct' : 10, 'Nov' : 11, 'Dec' : 12}
pd_loan_filter_null['earliest_cr_line_month'] = pd_loan_filter_null['earliest_cr_line']\
.apply(lambda x: x.split('-')[0])\
.replace(months).astype(float)
pd_loan_filter_null['earliest_cr_line_year'] = pd_loan_filter_null['earliest_cr_line']\
.apply(lambda x: x.split('-')[1]).astype(float)
pd_loan_filter_null.head()
| mths_since_last_record | mths_since_recent_bc_dlq | mths_since_last_major_derog | mths_since_recent_revol_delinq | mths_since_last_delinq | emp_title | emp_length | mo_sin_old_il_acct | revol_util | dti | loan_status | term | int_rate | installment | grade | sub_grade | mort_acc | pub_rec_bankruptcies | home_ownership | annual_inc | verification_status | total_acc | initial_list_status | purpose | addr_state | earliest_cr_line | fico_range_low | fico_range_high | mo_sin_old_rev_tl_op | application_type | open_acc | pub_rec | loan_amnt | earliest_cr_line_month | earliest_cr_line_year | |||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | url | zip_code | |||||||||||||||||||||||||||||||||||
| 112038251 | https://lendingclub.com/browse/loanDetail.action?loan_id=112038251 | 923xx | 84.0 | NaN | 24.0 | NaN | 24.0 | Solutions Architect | 6.0 | 130.0 | 22.7 | 16.99 | Fully Paid | 36.0 | 7.35 | 359.26 | A | 4.0 | 2.0 | 1.0 | OWN | 153000.0 | Not Verified | 46.0 | w | credit_card | CA | Jul-1994 | 720.0 | 724.0 | 275.0 | Individual | 20.0 | 1.0 | 11575.0 | 7.0 | 1994.0 |
| 112149045 | https://lendingclub.com/browse/loanDetail.action?loan_id=112149045 | 985xx | NaN | NaN | 72.0 | NaN | 72.0 | Pse | 2.0 | 21.0 | 98.9 | 6.07 | Fully Paid | 36.0 | 24.85 | 285.70 | E | 3.0 | 0.0 | 0.0 | RENT | 50000.0 | Source Verified | 5.0 | w | debt_consolidation | WA | Jan-2000 | 685.0 | 689.0 | 172.0 | Individual | 4.0 | 0.0 | 7200.0 | 1.0 | 2000.0 |
| 112052261 | https://lendingclub.com/browse/loanDetail.action?loan_id=112052261 | 750xx | NaN | NaN | NaN | NaN | NaN | Associate Director | 7.0 | 48.0 | 27.2 | 13.12 | Fully Paid | 36.0 | 7.35 | 232.79 | A | 4.0 | 4.0 | 0.0 | MORTGAGE | 110000.0 | Not Verified | 27.0 | w | debt_consolidation | TX | Mar-2013 | 710.0 | 714.0 | 51.0 | Individual | 19.0 | 0.0 | 7500.0 | 3.0 | 2013.0 |
| 111999259 | https://lendingclub.com/browse/loanDetail.action?loan_id=111999259 | 958xx | 55.0 | NaN | NaN | NaN | NaN | Biller | 7.0 | 130.0 | 20.0 | 10.11 | Fully Paid | 60.0 | 16.02 | 243.29 | C | 5.0 | 0.0 | 2.0 | RENT | 51979.0 | Source Verified | 21.0 | w | debt_consolidation | CA | Aug-2006 | 690.0 | 694.0 | 54.0 | Individual | 15.0 | 2.0 | 10000.0 | 8.0 | 2006.0 |
| 111808508 | https://lendingclub.com/browse/loanDetail.action?loan_id=111808508 | 026xx | NaN | NaN | NaN | NaN | 17.0 | cdl driver | 7.0 | 109.0 | 90.0 | 10.86 | Fully Paid | 36.0 | 16.02 | 492.34 | C | 5.0 | 1.0 | 0.0 | MORTGAGE | 75000.0 | Verified | 10.0 | w | debt_consolidation | MA | May-2008 | 685.0 | 689.0 | 97.0 | Individual | 4.0 | 0.0 | 14000.0 | 5.0 | 2008.0 |
pd_loan_filter_null.shape
(442567, 35)
pd_loan_filter_null.drop("earliest_cr_line",axis=1).to_csv("./data_preprocessing/pd_data_initial_preprocessing.csv")